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COMPUTATIONAL NEUROSCIENCE 

Efficient coding of spectrotemporal binaural sounds leads 
to emergence of the auditory space representation 

Wiktor Mtynarski * 

Max-Planck Institute for Mathematics in the Sciences, Leipzig, Germany 

To date a number of studies have shown that receptive field shapes of early sensory 
neurons can be reproduced by optimizing coding efficiency of natural stimulus ensembles. 
A still unresolved question is whether the efficient coding hypothesis explains formation 
of neurons which explicitly represent environmental features of different functional 
importance. This paper proposes that the spatial selectivity of higher auditory neurons 
emerges as a direct consequence of learning efficient codes for natural binaural 
sounds. Firstly, it is demonstrated that a linear efficient coding transform — Independent 
Component Analysis (ICA) trained on spectrograms of naturalistic simulated binaural 
sounds extracts spatial information present in the signal. A simple hierarchical ICA 
extension allowing for decoding of sound position is proposed. Furthermore, it is shown 
that units revealing spatial selectivity can be learned from a binaural recording of a natural 
auditory scene. In both cases a relatively small subpopulation of learned spectrogram 
features suffices to perform accurate sound localization. Representation of the auditory 
space is therefore learned in a purely unsupervised way by maximizing the coding 
efficiency and without any task-specific constraints. This results imply that efficient coding 
is a useful strategy for learning structures which allow for making behaviorally vital 
inferences about the environment. 

Keywords: efficient coding, natural sound statistics, binaural hearing, spectrotemporal receptive fields, auditory 
scene analysis 
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1. INTRODUCTION 

As originally proposed by Barlow (1961), the efficient coding 
hypothesis suggests that sensory systems adapt to the statisti- 
cal structure of the natural environment in order to maximize 
the amount of conveyed information. This implies that stimu- 
lus patterns encoded by sensory neurons should reflect statistics 
and redundancies present in the natural stimuli. Indeed - it has 
been demonstrated that learning efficient codes of natural images 
(Olshausen and Field, 1996; Bell and Sejnowski, 1997) or sounds 
(Lewicki, 2002) reproduces shapes of neural receptive fields in 
the visual cortex and the auditory periphery. Additionally, recent 
studies provided statistical evidence suggesting that spectrotem- 
poral receptive fields (STRFs) at processing stages beyond the 
cochlea, are adapted to the statistics of the auditory environment 
(Klein et al., 2003; Carlson et al., 2012; Terashima and Okada, 
2012) to provide its efficient representation. However, having a 
sole representation of the stimulus is not enough for the organ- 
ism to interact with the environment. In order to perform actions, 
the nervous system has to extract relevant information from the 
raw sensory data and then segregate it according to its functional 
meaning. For example the auditory system must extract posi- 
tion invariant information regardless of sound quality, separating 
"what" and "where" information. In a more recent paper (Barlow, 
2001) Barlow proposed that behaviorally relevant stimulus fea- 
tures (i.e., ones supporting informed decisions) may be learned 
by redundancy reduction. In other words, functional segregation 



of neurons can be achieved by efficient coding of sensory inputs. 
The evidence in support of this notion is still sparse. 

Among different sensory mechanisms, spatial hearing provides 
a good example for the extraction and separation of behaviorally 
vital information from the sensory signal. The ability to localize 
and track sound sources in space is of a critical relevance for the 
survival of most animal species. Contrary to vision, audition cov- 
ers the entire space surrounding the listener and may therefore 
provide an early warning about the presence or motion of objects 
in the environment. Mammals localize sounds on the azimuthal 
plane using two ears (Schnupp and Carr, 2009; Grothe et al., 
2010). Binaural hearing mechanisms rely on between-ear dispar- 
ities to infer the spatial position of the sound source. In humans, 
according to the well known Duplex Theory (Strutt, 1907; Grothe 
et al., 2010), interaural time differences (ITDs) constitute a major 
cue for low frequency sound localization and sounds of high fre- 
quency (> 1500 Hz) are localized with interaural level differences 
(ILDs). However in natural hearing conditions, spectrotempo- 
ral properties of sounds vary continuously, hence combinations 
of cues available to the organism also change. Even though 
temporal differences on the order of microseconds are of a sub- 
stantial importance for sound localization, binaural neurons in 
the higher areas of the auditory pathway can be characterized with 
Spectrotemporal Receptive Fields (STRFS), which have much 
more coarse temporal resolution (ms) (Gill et al., 2006). Despite 
such loss of temporal accuracy, many of those neurons reveal 
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FIGURE 1 | Data generation process. (A) Interpretation of consecutive 
stages of data generation. The acoustic environment is either simulated (B) 
or recorded with binaural microphones (C). Further stages of the 
processing include frequency decomposition and transformation with a 
logarithmic nonlinearity, which emulates cochlear filtering (D) Positions of 
HRTFs around the head are marked with circles. 



sharp spatial selectivity (Schnupp et al., 2001) encoding the posi- 
tion of the sound source in space. What is the neural computation 
underlying this process remains an open question. 

The present paper uses spatial hearing as an example of a 
sensory task, to show how information of different meaning 
("what" and "where") can be clearly separated. As its main result, 
it provides computational evidence pointing that redundancy 
reduction leads to the separation of spatial information from 
the representation of the sound spectrogram. This means that 
formation of the neural auditory space representation can be 
achieved without the need of any task-specific computations but 
solely by applying the general principle of redundancy reduc- 
tion. It is demonstrated that Independent Component Analysis 
(ICA) - a linear efficient coding transform (Hyvarinen et al., 
2009) trained on a dataset of spectrograms of simulated as well as 
natural binaural speech sounds, extracts sound position invari- 
ant features separating them from the representation of the 
sound position itself. Learned structures can be understood as 
model spatial and spectrotemporal receptive fields of auditory 
neurons which encode different kinds of behaviorally relevant 
information. Current results are in line with known physiological 
phenomena and allow to make new experimental predictions. 

2. MATERIALS AND METHODS 

High order statistics of natural auditory signal were studied by 
performing ICA on a time-frequency representation of binaural 
sounds. 

As a proxy for natural sounds, speech was used in the present 
study. Speech comprises a rich variety of acoustic structures and 
has been successfully used to learn statistical models predicting 
properties of the auditory system (Klein et al, 2003; Smith and 
Lewicki, 2006; Carlson et al., 2012). Additionally, it has been sug- 
gested that speech may have evolved to match existing neural 
representations, which are optimizing information transmission 
of environmental sounds (Smith and Lewicki, 2006). 

Spatial sounds were obtained in two ways. Firstly, the effi- 
cient coding algorithm was trained using simulated naturalistic 
binaural sounds. Simulation gave the advantage of labeling each 
sound with its spatial position. Secondly a natural auditory scene 
was recorded with binaural microphones. The signal obtained in 
this way was less controlled, however it contained more com- 
plex and fully natural spatial information. Training datasets were 
obtained by drawing 70000 random intervals 216 ms long from 
each dataset separately. The data generation process together with 
its interpretation is displayed on Figure 1. 

2.1. SIMULATED SOUNDS 

As a corpus of natural sounds, data from the International 
Phonetic Association Handbook (International Phonetic 
Association, 1999) were used. The database contains speech 
sounds of a narrative told by male and female speakers in 29 
languages. All sounds were downsampled to 16000 Hz from their 
original sampling rate and bandpass filtered between 200 and 
6000 Hz. The training dataset was created by drawing random 
intervals of 216 ms from the speech corpus data. Spatial sounds 
were simulated by convolving sampled speech chunks with 
human Head Related Transfer Functions (HRTFs). HRTF fully 



describe the sound distortion due to the filtering by the pinnae 
and therefore contain entire spatial information available to the 
organism. Given an angular sound source position 6, HRTF is 
defined by a pair of linear filters: 

HRTF(Q) = {h L ,Q(t), h Rfi (t)} (1) 

where L, R subscrpits denote left and right ear respectively, and 
f denotes time sample. One should note that in the temporal 
domain, HRTFs are often called Head Related Impulse Response 
(HRIR). A set of HRTFs was taken from the LISTEN database 
(Warfusel, 2002). The database contains human HRTFs recorded 
for 187 positions in the three-dimensional space surrounding the 
subject's head. HRTFs from a single random subject were selected 
and further limited to positions lying on the azimuthal plane with 
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15° spacing (24 positions in total). Monaural stimulus vectors 
XE(t) (-Eg {-R, L] denotes the ear) were created by drawing ran- 
dom chunks g(t) of speech sounds and convolving them with 
HRTF(Q) corresponding to an azimuthal position 9, which was 
also randomly drawn: 

/oo 
h E (x)g(t - x)dx (2) 
-oo 

where * denotes the convolution operator. In this data, spatial and 
identity information constitute independent factors. 

2.2. NATURAL SOUNDS 

In order to obtain a dataset of natural binaural sounds a com- 
plex auditory scene was recorded using binaural microphones. 
The recording consisted of three people (two males and one 
female) engaged in a conversation while moving freely in an 
echo-free chamber. Such an environment without reflections and 
echoes reduced the number of factors modifying sound wave- 
forms. One of the male speakers was recording the audio signal 
with Soundman OKM-II binaural microphones placed in the 
ear channels. In total 20 min were recorded and included mov- 
ing and stationary, often overlapping sound sources. To test the 
spatial sensitivity of learned features a recording with a single 
male speaker was performed. He walked around the head of the 
recording subject with a constant speed following a circular tra- 
jectory while reading a book out loud, twice in the clockwise and 
twice anti-clockwise direction. The length of the testing dataset 
was 54 s. 

2.3. SIMULATED COCHLEAR PREPROCESSING 

Before reaching the auditory cortex, where spatial receptive fields 
(SRFs) were observed (Schnupp et al, 2001), sound waveforms 
undergo a substantial processing. Since the modeling focus of the 
present study was beyond the auditory periphery, the data were 
preprocessed to roughly emulate the cochlear filtering (see the 
scheme on Figure 1). 

Short Time Fourier Transform (STFT) was performed on each 
sound interval included in the training dataset. Each chunk was 
divided into 25 overlapping windows each 16 ms long. STFT 
spanned 256 frequency channels logarithmically spaced between 
200 and 4000 Hz (decomposition into arbitrary, non-linearly 
spaced frequency channels was computed using the Goertzel 
algorithm). Logarithmic frequency spacing was observed in the 
mammalian cochlea and seems to be a robust property across 
species (Greenwood, 1990; Smith et al., 2002). The spectral power 
of the resulting spectrograms was transformed with a logarithmic 
function which emulates the cochlear compressive non-linearity 
(Robles and Ruggero, 2001). 

Stimuli were 216 ms long in order to match the temporal 
extent of cortical neurons' STRFs, which were characterized by 
SRFs (Schnupp et al., 2001). Besides emulating the cochlear trans- 
formation of the air pressure waveform, such spectrograms were 
reminiscent of the sound representation most effective in map- 
ping spectrotemporal receptive fields in the songbird midbrain 
(Gill et al., 2006). A very similar representation was used in a 
recent sparse coding study (Carlson et al., 2012). 



Spectrograms of left and right ears were concatenated. Such 
data representation attempts to simulate the input to higher bin- 
aural neurons, which operate on spectrotemporal information, 
simulteneously fed from monaural channels (Schnupp et al., 
2001; Miller et al, 2002; Qiu et al, 2003). In principle, we could 
first train ICA on monaural spectrograms and then model their 
codependencies. In such way, however, the algorithm could not 
explicitly model binaural correlations. Additionally, this would 
require application of a hierarchical model, which lies outside 
of the scope of this study. Our approach resembles ICA stud- 
ies, which focused on modeling of visual binocular receptive 
fields (Hoyer and Hyvarinen, 2000; Hunt et al, 2013). There, the 
input to binocular neurons in the visual cortex was modeled by 
concatenating image patches from the left and the right eye. 

The efficient coding algorithm was run on the resulting time- 
frequency representation of the binaural waveforms. After pre- 
processing the dimensionality of data vectors was equal to 2 x 
(25 x 256) = 12800. Both training datasets: simulated and nat- 
ural one consisted of 70000 samples. Prior to the ICA learning, 
the data dimensionality was reduced with Principal Component 
Analysis (PCA) to 324 dimensions, preserving more than 99% of 
total variance in both cases. Due to memory issues (allocation of a 
very large covariance matrix) a probabilistic PCA implementation 
was used (Roweis, 1998). 

2.4. INDEPENDENT COMPONENT ANALYSIS 

ICA is a family of algorithms which attempt to find a maximally 
non-redundant, information-preserving representation of the 
training data within the limits of the linear transform (Hyvarinen 
et al., 2009). In its standard version, given the data matrix X e 
gnxm (whgj-g n j s me number of data dimensions and m the num- 
ber of samples), ICA learns a filter matrix W e K" x ", such that: 

WX = S (3) 

where columns of X are data vectors x e W, rows of W are lin- 
ear filters w € R" and S G W xm is a matrix of latent coefficients. 
Equivalently the model can be defined using a basis function 
matrix A = W , such that: 

X = AS (4) 

Columns a e R" of matrix A are known as basis functions and 
in neural systems modeling can be interpreted as linear stimulus 
features represented by neurons, which form an efficient code of 
the training data ensemble (Hyvarinen et al., 2009). Linear coef- 
ficients s are in turn a statistical analogy of the neuronal activity. 
Since they can take both positive and negative values, their direct 
interpretation as physiological quantities such as firing rates is not 
trivial. For a discussion of relationship between linear coefficients 
and neuronal activity, please refer to Rehn and Sommer (2007); 
Hyvarinen et al. (2009). Equation 4 implies that each data vector 
can be represented as a linear combination of basis functions a 

x(t) = W(t) (5) 
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where t indexes the data dimensions. The set of basis functions 
a is called a dictionary. ICA attempts to learn a maximally non- 
redundant code. For this reason latent coefficients s are assumed 
to be statistically independent i.e., 



(6) 



The marginal probability distributions p(s;) are usually assumed 
to be sparse (i.e., of high kurtosis), since natural sounds and 
images have intrinsically sparse structure (Olshausen and Field, 
1997) and can be represented as a combination of a small num- 
ber of primitives. In the present work logistic distribution of the 
form: 

expt-^) 
^(l+expC-^)) 2 

with position |x = 0 and scale parameter i= = 1 is assumed. The 
basis functions are learned by maximizing the log-likelihood of 
the model via gradient ascent (Hyvarinen et al, 2009). 

2.5. ANALYSIS OF LEARNED BASIS FUNCTIONS 

Similarity between left and right ear parts of learned basis func- 
tions was assessed using the Binaural Similarity Index (BSI), as 
proposed in Miller et al. (2002). The BSI is simply Pearson's cor- 
relation coefficient between left and right ear parts of each basis 
function. BSI equal to —1 means that absolute values at every fre- 
quency and time position are equal and have the opposite sign, 
while BSI equal to 1 means that the basis function represents the 
same information in both ears. 

Dictionary of binaural basis functions learned from natu- 
ral data was classified according to the modulation spectra of 
their left ear parts. A modulation spectrum is a two-dimensional 
Fourier transform of a spectrogram. It is informative about spec- 
tral and temporal modulation of learned features and it has 
been applied to study properties of natural sounds (Singh and 
Theunissen, 2003) and real (Miller et al., 2002) as well as modeled 
(Saxe et al., 2011) receptive fields in the auditory system. 

Spatial sensitivity of basis functions learned from natural data 
was further quantified by means of Fisher information. Fisher 
information is a measure of how accurate one can estimate a 
hidden parameter G from an observable s knowing a conditional 
probability distribution p(s\Q) (Brunei and Nadal, 1998). Here, 
0 corresponds to the angular position of the auditory stimulus 
and s to one of the sparse coefficients. Assuming a deterministic 
mapping s(0) = /(6) = |i,e distorted with a zero-mean stationary 
Gaussian noise, one obtains: 



p(s|9)=^(5|^e,cr) 



(8) 



For simplicity cr was assumed to be equal to 1 . Fisher information 
1(9) then becomes (Brunei and Nadal, 1998): 



(9) 



head of the subject. Each activation time course was additionally 
smoothed with a 20 samples long rectangular window. 

3. RESULTS 

Besides the properties of the sound source itself, natural sounds 
reaching the ear membrane are also shaped by head-related 
filtering. The spectrotemporal structure imposed by the filter 
depends on the spatial configuration of objects. By performing 
redundancy reduction the auditory system could, in principle, 
separate those two sources of variability in the data and extract 
spatial information. One should observe that transformations 
performed by the cochlea can strongly facilitate this task. The 
stimulus xe (where E e L, R indicates the left or the right ear) 
is an air pressure waveform g(t) convoluted with an HRTF (or 
a combination of HRTFs) /i£,e(f)> as defined by equation 2. The 
basilar membrane performs frequency decomposition, emulated 
here by the Fourier transform: 

/oo 
XE(t) exp{—2nia>t)dt 
-oo 

= A*(cos*f, u) + isincf J £ i J (10) 

where co denotes frequency, A\ m amplitude and 4>| m phase. By 
the convolution theorem (Katznelson, 2004), convolution in the 
temporal domain is equivalent to a pointwise product in the 
frequency domain, i.e., 



J — c 



F((g * hg), w) = / g(f) rap(— 2jn'cof) dt / liE(t) exp(— 2niti>t)dt 
= A*(cos(t4 + ism^A^Ccos^ + isin^J 

Additionally, the basilar membrane applies a compressive non- 
linearity (Robles and Ruggero, 2001) which this study approx- 
imates by transforming the spectral power with a logarithmic 
function. Since the logarithm of the product is equal to the sum 
of logarithms, the spectral amplitude of the stimulus Ag = 
A\ M Af, can be decomposed into the sum: 

l0g(4aX») = l°g(4, J + !°g( A ») 



(ID 



Mean values 0,9 were estimated by averaging coefficient activa- 
tions over four trials during which the speaker walked around the 



This means that the spectrotemporal representation of the signal 
generated by the cochlea is a sum of the raw sound and HRTF fea- 
tures. One should note, however, that the above analysis applies 
to an infnite window Fourier transform, and the data used in this 
study was generated by performing a STFT with a 16 ms long, 
overlapping windows. Fourier coefficients were mixed between 
neighboring windows due to their overlap. For point-source, sta- 
tionary sounds this effect did not influence the log(A^ ) term 
of the Equation 11, since HRTFs were shorter than the STFT 
window, hence hear-related filtering was temporally constant. 
For a dynamic scene, where neighboring STFT windows con- 
tained different spatial information, the additive separability of 
sound and HRTF features (as described by Equation 11) may 
have been distorted. Taken together, a linear redundancy reducing 
transform such as ICA provides a reasonable approach to sepa- 
rate information about object positions from the raw sound. In 
an ideal case, ICA trained on stimulus spectrograms A* could 
separate representation of HRTF {A\ ) and stimulus (A| ) 
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amplitudes into two distinct basis functions sets (Harper and 
Olshausen, 201 1). The difficulty of the separation task depends on 
the temporal variability of the spatial information which reflects 
configuration of the environment (i.e., number of sources, their 
motion patterns and positions). The current study considers two 
cases of different complexity: (a) simulated dataset consisting of 
short periods of speech displayed from single positions and (b) a 
binaural recording of a natural scene with freely moving human 
speakers. 

3.1. SIMULATED SOUNDS 

The goal of the present study was to identify high-order statis- 
tics of natural sounds informative about positions of the sound 
source. Association of a sound waveform with its spatial position 
requires detailed knowledge about source localization i.e., each 
sound should be labeled with spatial coordinates of its source. 
For this reason binaural sounds studied in this section were sim- 
ulated, using speech sounds and human HRTFs. Naturalistic data 
created in this way resembled binaural input from the natural 
environment, while making position labeling of sources available. 

From the simulated dataset, after reducing data dimensionality 
with PCA (see section 2.3), 324 ICA basis functions were learned. 
A subset of 100 features is depicted in Figure 2. It is clearly visible 
that the learned basis can be divided into two separate subpop- 
ulations by the similarity between their left and right ear parts, 
which is quantified by the Binaural Similarity Index (BSI) (see 



Materials and Methods). Sorted values of the BSI are displayed on 
Figure 6A as black circles. The majority of basis functions (314) 
exceed the 0.9 threshold and only 10 fall below it. Out of those 8 
reveal strong negative interaural correlation and only 2 are close 
to 0. Basis functions with the BSI below 0.9, were separated from 
the rest and all ten of them are depicted on Figure 2A. Since they 
represent different information in each ear they are going to be 
called "binaural" through the rest of the paper. This is in con- 
trast to "monaural" basis functions which encode similar sound 
features in both ears (see Figure 2B). 

The binaural sub-dictionary captures signal variability present 
due to the head-related filtering. Even though the training dataset 
included sounds displayed from 24 positions, hence 24 different 
HRTFs were used, only 10 binaural basis functions emerged from 
the ICA. Out of those, almost all are temporally stable i.e., do not 
reveal any temporal modulation (except for 2 - positions 5 and 6 
on Figure 2 A). The dominance of temporally constant features 
was expected, since training sounds were displayed from fixed 
positions and were convoluted with filters, which did not change 
in time. Temporally stable basis functions weight spectral power 
across frequency channels, mostly with opposite sign in both ears 
(as reflected by negative values of the BSI). Surprisingly, despite 
the lack of moving sounds in the training dataset, two tempo- 
rally modulated basis functions were also learned by the model. 
They represent envelope comodulation in high frequencies with 
an interaural phase shift of it radians. 

A representative subset of 90 monaural basis functions is 
depicted on Figure 2B. Their left and right ear parts are exactly 
the same and encode a variety of speech features. Regularities 
such as harmonic stacks, on- and offsets or formants are visible. 
Captured monaural patterns essentially reproduce results from a 
recent study by Carlson et al. (2012) which shows that efficient 
coding of speech spectrograms learns features similar to STRFs in 
the Inferior Colliculus (IC). Monaural basis functions are, how- 
ever, not a focus of the present study and are not going to be 
discussed in detail. 

A separation of the learned dictionary into two subpopu- 
lations of binaural and monaural basis functions (a g and a k 
respectively) allows to represent every sound spectrogram in the 
training dataset as a linear combination of two isolated factors i.e., 
representations of speech and HRTF structures (see Figure 2C). 
Taking this fact into account, Equation 5 can be rewritten as: 

G K 
x(f) + (12) 

i=l ;=1 

This notation explicitly decomposes the basis into G spatial basis 
functions a g and K and non-spatial basis functions a k . 

3. 1. 1. Emergence of model spatial receptive fields 

Marginal coefficient histograms conformed rather well to the 
logistic distribution assumed by the ICA model, although bin- 
aural coefficients were typically more sparse (see Figure 3). In 
order to understand how informative learned features are about 
position of sound sources, conditional distributions of the linear 
coefficients were studied. Histograms conditioned on a location 
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FIGURE 2 | ICA basis functions trained on simulated sounds (A) 
Binaural basis functions a?. Left and right ear parts are dissimilar. (B) 
Monaural basis functions a{\ Left and right ear parts are highly similar. (C) 
Explanation of the representation. Each stimulus can be decomposed into i 
linear combination of monaural basis functions (multiplied by their 
coefficients s^) and binaural ones multiplied by coefficients sf. 
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of a sound source reveal whether any spatial information is 
encoded by learned basis functions. 

Figures 3A-F displays 6 basis functions and corresponding 
conditional histograms. The horizontal axis of each conditional 
histogram corresponds to the angular position of the sound 
source (from 0 to 345°). A vertical cross-section is a normalized 
histogram of the coefficient values for all sounds displayed in the 
training dataset from a particular position (around 2900 samples 
on average). 

Three representative monaural basis functions are depicted on 
Figures 3D-F. It is immediately visible that conditional distribu- 
tions of their coefficients are stationary across spatial positions. 
The zero-centered logistic pdf with a constant scale parameter 
(parameters equal to those of the marginal pdf) is preserved 
across all positions. This implies that coefficients of monaural 
basis functions are independent from the sound source location. 
Monaural bases encode speech features and since all speech struc- 
tures were displayed from all positions in the training data, their 
activations do not carry spatial information. This property is 
characteristic for all basis functions with BSI greater than 0.9. 

Coefficients of binaural basis functions reveal a very differ- 
ent dependency structure (see Figures 3A-C. Their variance at 
each spatial position is very low, however, variability across posi- 
tions is much higher. Activations of binaural features remain close 



to zero at most angular positions regardless of the sound iden- 
tity. At few preferred positions they reveal pronounced peaks 
in activation (positive or negative) reflected by strong shifts in 
the mean value. This highly non-stationary structure of condi- 
tional pdfs is informative about the sound position, while remains 
almost invariant to the sound's identity (which is reflected by the 
small standard deviation). Basis function depicted on Figure 3A 
responds with a strong positive activation to sounds originating at 
270° (i.e., directly in front of the right ear) and with a strong neg- 
ative activation to sounds originating from the directly opposite 
location - at 90° (i.e., in front of the left ear). Sounds at posi- 
tions deviating ±15° from peaks also modulate basis activations, 
although activations are weaker. Similar spatial selectivity pattern 
is revealed by the basis function on Figure 3C, which however 
responds positively to sounds at 60 and negatively to sounds at 
315°. The spectrotemporal feature on Figure 3B encodes spatial 
information of a particularly high behavioral relevance. Its activ- 
ity significantly deviates from zero, only when sounds are placed 
behind the head in the interval between 165 and 210° . This region 
is not visually accessible, therefore position or motion of objects 
in that area has to be inferred basing on auditory information 
only. It may appear that conditional histograms are symmetric 
around the 180° point. However, positive and negative peaks of 
coefficient histograms do not have exactly equal absolute values. 
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It is important to notice here that each spectrotemporal feature 
captured by binaural basis functions is an indirect representation 
of the sound position in the surrounding environment. Therefore 
if ICA basis functions can be interpreted as STRFs of binaural 
neurons, the corresponding conditional histograms constitute a 
theoretical analogy of their SRFs informing the organism about 
the position of the sound source within the head-centered frame 
of reference. 

3. 1.2. Decoding of the sound position 

As described in the previous subsection, linear coefficients of bin- 
aural basis functions are informative about the location of the 
sound source. Spatial selectivity of single basis functions is how- 
ever not specific enough to reliably localize sounds. Pairwise coef- 
ficient activations of two exemplary basis functions are depicted 
on Figure 3G. Each point represents a single sound and its color 
corresponds to the source's angular position. Strong clustering of 
same-colored points is strongly visible. They form at least 6 highly 
separable clusters. This, in turn, shows that the joint distribution 
of those two coefficients contains more information about the 
source position than one dimensional conditional pdfs. This is 
in contrast to Figure 3H depicting co-activations of two monau- 
ral basis functions. There, points of all colors are strongly mixed, 
creating a "salt and pepper" pattern, where no clear separation 
between source positions is visible. 

To test, whether reliable decoding of sound position from acti- 
vations of binaural basis functions is possible, this work employs 
the Gaussian Mixture Model (GMM). The GMM models the 
marginal distribution of latent coefficients s g used for the posi- 
tion decoding as a linear combination of Gaussian distributions, 
such that: 

24 

p(s*) = 2>(s*ia)p(Q) (13) 
k=l 

p(s?\C k )=M(£\ i i k ,D k ) (14) 

where Q is a position label (Ci = 0 deg, C24 = 345 deg) and 
[i k , D k denote a position specific mean vector and covariance 
matrix respectively. The structure of dependencies among ran- 
dom variables is presented in a graphical form in Figure 4A. Since 
the prior on position labels p(Q) is assumed to be uniform, 
the decoding procedure can be recast as a maximum-likelihood 
estimation: 

C = argmaxp(/|Q) (15) 

k 

where C is the decoded position. The resulting procedure iterates 
over all position labels and returns the one which maximizes the 
probability of an observed data sample. 

The decoding performance relies on the selected subset of basis 
functions used for this task. To test whether binaural features con- 
tribute stronger to the position decoding than monaural ones, 
all basis functions were sorted according to their BSI. Then, the 
GMM was trained using incrementally larger number of latent 
coefficients, starting from a single one corresponding to the basis 
function with the highly negative BSI and ending using the entire 
basis function set. In every step, for the GMM training 70% of 
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FIGURE 4 I Position decoding model. (A) A graphical model representing 
variable dependencies. (B) Decoders performance plotted against the 
number of used basis functions. Vertical dashed line separates binaural 
basis functions from monaural ones. 



the data were used, while remaining 30% were used for cross- 
validation. The average decoder performance is plotted against 
the number of used features on Figure 4B. Binaural features 
are separated from the monaural ones with a dashed vertical 
line. A straightforward observation is that binaural basis func- 
tions almost saturate the decoding accuracy. Indeed it reaches 
the level of 97.9%. Adding remaining 314 monaural basis func- 
tions increases the performance to 99.7% which is only 1.8% 
point. Interestingly, temporally modulated binaural basis func- 
tions number 5 and 6 did not contribute to the decoding quality, 
which is visible as a short plateau on the plot. Saturation of 
the decoder's performance by binaural basis function activations 
entails that almost entire spatial information present in the sound 
is separated from other kinds of information by the ICA model 
and represented by binaural basis functions. Relating this obser- 
vation to the nervous system, this means, that the spatial position 
of natural sound sources can be decoded from the joint activity of 
a relatively small subpopulation of binaural neurons. 

3.2. NATURAL SOUNDS 

The previous section described results for simulated sounds. 
While simulated sounds have the advantage of giving a full control 
over source positions they are only a very crude approximation to 
the binaural stimuli occurring in the real natural environment. 
This section describes results obtained using binaural record- 
ings of a real-world auditory scene, consisting of three speakers 
moving freely in an echo-free environment. 

Binaurality of learned basis functions was again quantified 
with the BSI. Sorted BSI values are plotted on Figure 6A as gray 
triangles. A strong difference is visible, when compared with val- 
ues of the dictionary trained on simulated data (black circles). 
Firstly, 64 natural basis functions lay below the 0.9 threshold - 
many more compared to only 10 simulated ones. Secondly, natu- 
ral BSIs vary more smoothly, and are more uniformly distributed 
between —1 and 0.9 (see the histogram displayed in the inset). 

Similarly to the previous case, the learned dictionary was 
divided into two sub-dictionaries - binaural ones - below 
and monaural ones - above the 0.9 BSI threshold. The sub- 
dictionary consisting of binaural basis functions is displayed 
on Figures 5A,B displays 40 exemplary monaural basis func- 
tions. While no qualitative difference is visible between monaural 
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FIGURE 5 | Basis functions learned using natural data. (A) Binaural basis 
functions (60 out of 64). (B) Monaural basis functions (40 out of 250). 



features when compared with results from the previous section 
(Figure 2B), the binaural sub-dictionaries differ strongly. Basis 
functions trained using natural data, reveal much richer variety 
of shapes including temporally modulated ones along patterns of 
strong spectral modulation. 

3.2. 1. Properties of the learned representation 

This subsection presents properties of binaural basis functions 
trained with the natural binaural data. They were studied in more 
detail than the dictionary learned from simulated data since its 
structure is more complex and may reflect better the properties of 
binaural neurons. One should note that in neural systems model- 
ing, neural receptive fields correspond better to ICA filters (rows 
w of matrix W in Equation 12). Basis functions, however, consti- 
tute optimal stimuli i.e., given basis function a, as input the only 
non-zero coefficient is going to be s, . Additionally, basis functions 
are a low-passed version of filters (Hyvarinen et al., 2009), and are 
more appropriate for plotting, since they represent actual parts of 
stimulus. For those reasons, this study focuses on basis function 
statistics. 

The binaural dissimilarity of learned features was assessed with 
two measures. The BSI provides a continuous value quantifying 
how well the left ear part matches the right ear part. It however 
does not take into account the dominance of one ear over another. 
The dominance can be measured by comparing monaural peaks 
i.e., points of the maximal absolute value of left and right ear 
parts. Both measures were used by Miller et al. (2002) to describe 
receptive fields of binaural neurons in the auditory thalamus and 
cortex. Monaural peaks (measured in standard deviation of the 
basis function dimensions) are compared on Figure 6B. Crosses 
mark basis functions with the positive and diamonds with the 
negative BSI. Symbol sizes correspond to the absolute BSI value. 



Basis functions cluster along the diagonals (marked with dashed 
lines) which means that left and right ear peaks have similar abso- 
lute values and no clear dominance of a single ear is present. 
Interestingly, while roughly the same number of basis functions 
lays in upper right and both lower quadrants, only 4 lay in the 
upper left one, corresponding to basis functions with a negative 
peak in the left ear and positive in the right ear. Unfortunately, 
direct comparison of the analysis on Figure 6B with Figure 9 in 
Miller et al. (2002) is not possible, due to the arbitrariness of the 
sign in the ICA model (coefficients can have positive and negative 
values, flipping the sign of the basis function). Additionally the 
notion of ipsi- and contra- laterality is meaningless for ICA basis 
functions. 

Shapes of basis functions belonging to the binaural sub- 
dictionary were studied by analyzing modulation spectra of their 
left-ear parts. Even though functions were binaural, classification 
according to only the single ear part was sufficient to identify 
subgroups with interesting binaural properties. Centers of mass 
of modulation spectra (for computation details see Materials and 
Methods) are plotted as circles on Figure 6C. Circle color corre- 
sponds to the BSI value. Left parts of binaural features display a 
tradeoff between spectral and temporal modulation. This com- 
plies with the general trend of natural sound statistics (Singh 
and Theunissen, 2003). Dictionary elements were divided into 
three distinctive groups according to their modulation properties 
(marked with roman numerals I, II, III, and separated with dotted 
lines on Figure 6C). The first group consisted of weakly modu- 
lated features with spectral modulation below 0.3 cycles/octave 
and temporal modulation below 4 Hz. Majority of basis functions 
belonging to this group had high BSI, close to 0.9. Three repre- 
sentative members of the first group are displayed on Figure 6D 
in the first row. Since their spectrotemporal modulation is weak, 
they capture constant patterns, similar in both ears, up to the 
sign. The second group consists of basis functions revealing 
strong spectral modulation - above 0.3 cycles/octave. Three exem- 
plary members are visible in the second row of Figure 6D. Basis 
functions belonging to the second group resemble majority of 
ones learned from simulated data. They weight spectral power 
across frequency channels constantly over time. In contrast to 
simulated basis functions, their BSIs are mostly close to 0, indi- 
cating that channel weights do not necessarily have opposite sign 
between ears. Additionally, as visible in two out of three displayed 
examples, low frequencies below 1 kHz are also weighted. 

The third group includes highly temporally modulated fea- 
tures. Their temporal modulation exceeds 4 Hz, while the spectral 
one stays below 0.3 cycles/octave. Out of 15 members of this 
group, only one has a positive BSI value - the rest remains close 
to —1. This implies that when their monaural parts are aligned 
with each other - corresponding dimensions have a similar abso- 
lute value and an opposite sign. Three examplary members of the 
third group are depicted in the last row of Figure 6D. They are 
qualitatively similar to two temporal basis functions learned from 
the simulated data (they represent an envelope comodulation 
across multiple frequency channels with a jt phase difference). 

The temporal differences between monaural parts of basis 
functions were further studied using cross-correlation functions 
(ccf). Maximal values of the normalized ccf are plotted against 
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functions belonging to groups I, II, and III. (E) Temporal cross-correlation 
plotted against its peak value. Color marks the BSI. (F) A histogram of 
temporal shifts maximizing the cross correlation. 



maximizing temporal shifts on Figure 6E. As in the Figure 6C 
- the color of circles represents the BSI value. The histogram 
of temporal shifts is depicted on Figure 6F. Cross-correlation of 
30 binaural features with a positive BSI, is maximized at 0 tem- 
poral shift. In this case, BSI and the peak of cross-correlation 
have the same value. This is a property of basis functions with a 
weak temporal modulation, which constitute a major part of the 
binaural sub-dictionary. Features revealing temporal modulation 
have a negative BSI value (dark colors) and a non-zero temporal 
difference, which spanned the range between —0.2 and 0.2 s. 

3.2.2. Spatial sensitivity of binaural basis functions 

In contrast to the simulated dataset, binaural recordings were 
not labeled with sound source positions. Furthermore, learned 
features may represent dynamic aspects of the object motion, 
therefore conditional histograms (constructed as in the previous 
section) would not be meaningful. 

In order to verify whether binaural basis functions reveal tun- 
ing to spatial position of sound sources and invariance to their 
identity, a test recording was performed. One of the male speak- 
ers read a book out loud, while walking around the head of the 
recording subject, following a circular trajectory in a constant 



pace. This was repeated twice in the anticlockwise and twice in 
the clockwise direction. In such a way, the angular position of 
the speaker was made easy to estimate at each time point. The 
recording was divided into 216 ms overlapping intervals, and each 
interval was encoded using the learned dictionary. A general trend 
in the spatial sensitivity of basis functions was measured by com- 
puting correlation between estimated speaker's position and time 
courses of linear coefficients in the following way. Firstly, activa- 
tion time courses were standarized to have mean equal to 0 and 
variance equal to 1 . In the next step, time intervals where the coef- 
ficient's absolute value exceeded 1 were extracted. This was done, 
since highly sensitive coefficients remained close to 0 most of the 
time, and correlated with the speaker's position only in a narrow 
part of the space (i.e., their receptive field). Elements of the binau- 
ral sub-dictionary correlated stronger with the estimated position 
than elements of the monaural one. Normalized histograms of 
linear correlations between the position of the sound source and 
sparse coefficients are presented on Figure 7. Monaural basis 
functions correlate much weaker with the sound position, which 
is reflected in the strong histogram peak around 0. Binaural coef- 
ficients in turn, reveal strong correlations of the absolute value of 
0.8 in extreme cases. Linear correlation is however not a perfect 
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way to assess relationship between sparse coefficients and the 
source position, since spatial selectivity of basis function may 
be limited to a narrow spatial area (as in Figures 8A,B)- This 
results in correlations of low absolute values, even though spatial 
sensitivity of a basis function may be quite high. To show spa- 
tial selectivity of learned features, their activations were plotted. 
Resulting time courses of basis function activations are displayed 
as black continuous lines on Figure 8. Gray dashed lines mark 
approximated angular position of the speaker at every time point. 

Subfigures (F-J) display activations of 5 representative 
monaural basis functions. As expected, their activity correlates 
very weakly with the speaker's trajectory. Monaural basis func- 
tions encode features of speech and are invariant to the position 
of the speaker. In contrary, activations of binaural basis functions 
visible on subfigures (A-E), reveal strong dependence on sub- 
jects position and direction of motion. Basis function A remains 
non-activated for most of positions and deviates from zero when 
the speaker is crossing the area behind the head of the record- 
ing subject. The slope of activation time courses is informative 
about the direction of speaker's motion. Similar, however noisier, 
spatial tuning is revealed by the basis function D. Basis func- 
tion B displays broader spatial sensitivity, and its activation varies 
smoothly along the circle surrounding the subject's head. Spatial 
information represented by the spectrally modulated basis func- 
tions C and E does not have such a clear interpretation, however 
they display pronounced covariation with sound source's position 
(feature C for instance is strongly positively activated, when the 
speaker crosses directly opposite to the left ear). 

Spatial sensitivity of basis functions can be further quanti- 
fied using Fisher information (for computation details please see 
Materials and Methods). Figure 9 shows Fisher information esti- 
mates as a function of spatial position for features displayed on 
Figure 8. Each binaural basis function reveals a preferred region 
in space where source's position is encoded with higher accu- 
racy. For this reason, histograms depicted on Figures 9A-E can 
be interpreted as an abstract descriptions of auditory SRFs. Basis 
function (A), is most strongly informative about position of 



the sound source behind the head (around 180°), which is also 
reflected in the timecourse of its activation. The Fisher informa- 
tion peaks in visually inaccessible areas also in other, depicted 
basis functions [subfigures (B), (C), (E)]. There, however, the 
peak is not as pronounced as in the first basis function, and sen- 
sitivity to frontal positions is also visible. Fisher information of 
monaural basis functions [subfigures (F)-(J)] does not reveal 
spatial selectivity, is order of magnitude smaller and would most 
probably vanish in the limit of more samples. 

All binaural basis functions presented on Figure 8 are weakly 
temporally modulated. Temporally modulated basis functions, 
do not correlate strongly with the speaker's position (they also 
did not contribute to the position decoding, as described in the 
previous section). 

4. DISCUSSION 

Experimental evidence suggests that redundancy across neurons 
is decreased in the consecutive processing stages in the audi- 
tory system (Chechik et al., 2006). This is directly in line with 
the efficient coding hypothesis. Additionally, a number of studies 
have provided evidence of adaptation of binaural neural circuits 
to the statistics of the auditory environment over different time 
scales. Harper and McAlpine (2004) have argued that proper- 
ties of neural representations of ITDs across many species can be 
explained by a single principle of coding with maximal accuracy 
(maximization of Fisher information). Two recent experimental 
studies provide physiological and psychophysical evidence that 
neural representations of interaural level (Dahmen et al, 2010), 
and time (Maier et al., 2012) disparities are not static but adapt 
rapidly to statistics of the stimulus ensemble. The present study 
contributes to this lines of research, by showing that coding of 
the auditory space can be achieved by redundancy reduction of 
spectrotemporal sound representation. 

The auditory system has to infer the spatial arrangement of 
the surrounding space by analyzing spectrotemporal patterns of 
binaural sound. Auditory SRFs are formed, by extracting signal 
features which correlate well with environment's spatial states and 
result from the head related filtering. Both sound datasets used 
in the present study included two, categorically different vari- 
ability sources: spatial information carried by binaural differences 
resulting from the HRTF filtering and the raw sound waveform. 
Application of ICA - a simple redundancy reducing transform led 
to a separation of those information sources and formation of 
distinct model neuron sub-populations with specific spatial and 
spectrotemporal sensitivity. 

4.1. LINEAR PROCESSING OF SPECTROTEMPORAL BINAURAL CUES 

Emulation of the cochlear processing by performing spectral 
decomposition and application of the logarithmic non-linearity 
produces a data representation well adapted for the position 
decoding task. While it is usually argued that the logarithmic non- 
linearity implemented by mechanical response of the cochlear 
membrane is useful for reducing the dynamical range of the signal 
(Robles and Ruggero, 2001) it provides an additional advantage. 
Since in the frequency domain convolution is equivalent to a 
pointwise product of the signal and the filter (Katznelson, 2004), 
a logarithm transforms it to a simple addition. A linear operation 
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on the "cochlear" data representation suffices to extract features 
imposed by the pinnae filtering (Harper and Olshausen, 2011). 
One should note, however, that in complex listening situations 
involving more than a single, stationary sound source, this simple 
relationship (as described by Equation 11) may be distorted and 
extracted features can be mixing different aspects of the signal. 

It has been observed that a linear approximation of spec- 
trotemporal receptive fields in the auditory cortex predicts their 
spatial selectivity (Schnupp et al., 2001). This result may be 
surprising given that sound localization is a non-linear opera- 
tion (Jane and Young, 2000) and that in a general case, linear 
STRF models do not explain firing patterns of auditory neurons 
(Escabi and Schreiner, 2002; Christianson et al., 2008). Results 
shown in this paper suggest that a linear-redundancy reducing 
transform applied to log-spectrograms suffices to create model 



SRFs, providing a candidate computational mechanism explain- 
ing results provided by Schnupp et al. (2001). Localization of a 
natural sound source involves information included in multiple 
frequency channels. Binaural cues such as ILD are computed in 
each channel separately and have to be fused together at a later 
stage. This is exemplified by temporally constant basis functions 
learned using simulated and natural datasets. They linearly weight 
levels in frequency channel of both ears and in this way form their 
spatial selectivity. Interestingly the weighting is often asymmetric 
(which is reflected by BSI values different from —1). Such pat- 
terns represent binaural level differences coupled across multiple 
frequency channels. A recent study has shown that a similar com- 
putational strategy underlies spatial tuning of binaural neurons 
in the nucleus of the brachium of IC in monkeys (Slee and Young, 
2013). Since it has already been suggested that IC neurons code 
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natural sounds efficiently (Carlson et al., 2012), present results 
extend evidence in support of this hypothesis. 

4.2. COMPLEX SHAPES OF BINAURAL STRFs 

Early binaural neurons localized in the auditory brainstem can 
be classified according to kinds of input they receive from each 
ear (inhibitory-excitatory - IE and excitatory-excitatory - EE) 
(Grothe et al., 2010). At the higher stages of auditory processing 
(Inferior Colliculus, Auditory Cortex), binaural neurons respond 
also to complex spectrotemporal excitation-inhibition patterns 
(Schnupp et al, 2001; Miller et al., 2002; Qiu et al., 2003). The 
present paper suggests, which kinds of binaural features may 
be encoded and used for spatial hearing tasks by higher binau- 
ral neurons. It demonstrates that the reconstruction of natural 
binaural sounds requires basis functions representing various 
spectrotemporal patterns in each ear. The dictionary of learned 
binaural features is best described by a continuous binaural sim- 
ilarity value (in this case Pearson's correlation coefficient - BSI) 
and not by a classification into non-overlapping IE-EE groups. 
Temporally modulated basis functions consitute a particularly 
interesting subset of all binaural ones. Many of them represent a 
single cycle of envelope modulation, in opposite phase in each ear 
(see Figure 6D). The time interval corresponding to such phase 
shift is, however, much larger than the one required for the sound- 
wave to travel between the ears. Their emergence and aspects of 
the environment they represent remain to be explained. Coding 
of different spectrotemporal features in each ear is useful not only 
for sound localization and tracking, but may be also applied for 
separation of sources while parsing natural auditory scenes (i.e., 
solving the "cocktail party problem"). 

4.3. THE ROLE OF HRTF STRUCTURE 

Spatial information is created when the sound waveform becomes 
convoluted with the head and pinnae filter - HRTF. By tak- 
ing into account that this convolution is equivalent to addi- 
tion of the log-spectral representation of the sound and the 
HRTF, one may conclude that the ICA recovers exact HRTF 
forms. A subset of basis functions learned by the ICA model 
from the simulated data could, in principle, contain 24 ele- 
ments, which would constitute an exactly recovered set of HRTFs 
used to generate the training data (see Figure ID). The other 
basis function subset would contain features modeling speech 
variability. This is, however, not the case. Firstly - in the simu- 
lated dataset - HRTFs corresponding to 24 positions were used, 
10 basis functions emerged and only 8 were temporally non- 
modulated, as HRTFs are. Despite such dimensionality reduction, 
information included in the 8 basis functions was sufficient 
to perform the position decoding with 15° spatial resolution. 
This implies that binaural basis functions did not recover HRTF 
shapes but rather formed their compressed representation. It 
is important to note here that learned binaural features were 
much smoother and did not include all spectral detail included 
in HRTFs themselves (compare basis binaural basis functions 
from Figures 2A, 5A with HRTFs from Figure ID). The fact that 
coarse spectral information suffices to perform position decod- 
ing stands in accord with human psychophysical studies. It has 
been demonstrated that HRTFs can be significantly smoothed 



without influencing human performance in spatial auditory tasks 
(Kulkarni and Colburn, 1998). 

In humans and many other species, the area behind the lis- 
tener's head is inaccessible to vision and information about the 
presence or motion of objects there can be obtained only by listen- 
ing. This particular spatial information is of high survival value 
since it may inform about an approaching predator. Interestingly, 
in both used datasets features providing pronounced informa- 
tion about presence of sound sources behind the head clearly 
emerged (see Figures 3B, 8A). Their sensitivity to sound posi- 
tion quantified with Fisher information is highest for the area 
roughly between 160 and 230°. Since those basis functions reflect 
the HRTF structure, one could speculate that the outer ear shape 
(which determines the HRTF) was adapted to make this valubable 
spatial information explicit. It is interesting to think that one of 
the factors in pinnae evolution, was to provide spectral filters, 
highly informative about sound positions behind the head. This, 
however, can not be verified within the current setup and remains 
a subject of the future research. 

4.4. CONCLUSION 

Taken together, this paper demonstrates that a theoretical princi- 
ple of efficient coding can explain the emergence of functionally 
separate neural populations. This is done using an exemplary task 
of binaural hearing. 

In a previous work by Asari et al. (2006), it has already been 
shown that a sparse coding approach may be useful for monaural 
position decoding. Their work, however, did not show separation 
of spatial and identity information into two distinct channels. 
Additionally in contrast to this study (Asari et al., 2006), relied 
on simulated data which did not include patterns such as sound 
motion. 

Learning independent components of binaural spectrograms 
has extracted spatially informative features. Their sensitivity to 
sound source position, was therefore a result of applying a 
general-purpose strategy. This may suggest that seemingly dif- 
ferent neural computations may be instantiations of the same 
principle. For instance, it may appear that auditory neurons with 
spatial and spectrotemporal receptive fields must perform differ- 
ent computations. Current results imply that they may be sharing 
a common underlying design principle — efficient coding, which 
extracts stimulus features useful for performing probabilistic 
inferences about the environment. 
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